EnigmA Amiga Run 1995 October

home *** CD-ROM | disk | FTP | other *** search

/ EnigmA Amiga Run 1995 October / EnigmA AMIGA RUN 01 (1995)(G.R. Edizioni)(IT)[!][issue 1995-10][Aminet 7].iso / Aminet / comm / tcp / GetURL_1_03.lha / GetURL-1.03.doc next >

Wrap

Text File | 1995-02-01 | 30KB | 747 lines

{========================================================================} { File Name : 'GetURL.doc', 21-Jan-95 } {========================================================================} -- Script to download HTML systems across the network -- {========================================================================} { Contents } {========================================================================} 1. Introduction 2. Installation 3. Use 3.1. Basic Use 3.2. Options 3.3. Help & Setup Options 3.4. Control Options 3.5. Restriction Options 3.6. Input & Output Options 4. I Robot 5. Examples 6. Match 7. New Versions 8. Known Bugs & Problems 9. Glossary 10. Changes 11. Credits 12. Contact {========================================================================} { Introduction } {========================================================================} GetURL.rexx is an ARexx script which will download World-Wide Web pages. With a simple command line it will download a specific page, and with more complex command lines it will be able to download specific sets of documents including many pages. The intention was to create a tool that allowed local cacheing of important web pages and a flexible way of specifying what pages are important. The script has no GUI as of yet but may have at some stage in the future. If you have ever tried to download and save to disc a 200 page document using Mosaic, then you know what this script is for. Mosaic will only let you load a page, then load it to disc, then load another page etc. This is a very frustrating process. GetURL automates this process and will run in batch mode without user intervention. The major features of GetURL.rexx are as follows: * doesn't require AMosaic, so you can be browsing something else with AMosaic whilst this is running * save pages to your hard disc so that they can be read offline and you can also give them to friends on a floppy disc. Who know, you may even be able to sell discs containing web pages :-) * flexible set of command line switches that allow you to restrict the type of pages that it downloads * ability to specify files for the lists of URLs that it keeps so that any search for pages can be stopped and restarted at a later date. i.e. you could run GetURL for 2 hours a day whilst you are online and gradually download everything in the entire universe and it won't repeat itself. * includes the ability to download itself when there are new versions. * will use a proxy if you have access to one, in order to both speed up access to pages and also to reduce network load. * will download binary files (*.gif, *.lha) as easily as text and html files. * documentation is in the top of the script file. {========================================================================} { Installation } {========================================================================} Just copy the file GetURL.rexx to your REXX: directory You should also add an assign Mosaic: e.g. assign Mosaic: PD:Mosaic/ TimeZone ======== If you want to use the -IfModified flag (which is **VERY** useful) then you should also configure the TimeZone. Use a normal text editor, find the line which looks like gv.timezone = '' and enter your TimeZone expressed as a difference to Greenwich Mean Time (England Time) e.g. I am in Melbourne so my TimeZone is GMT+1100 so I put gv.timezone = '+1100' If I was in Adelaide I would be in teh TimeZone GMT+1030 gv.timezone = '+1030' Note: Anywhere in the USA is going to be GMT-???? so make sure you get it right. - If you are in England then put +0000 - Don't put symbols like EST or whatever, put it numerically. Match ===== Although not necessary, GetURL will perform better in the presence of 'Match', 'rexxarplib.library' Match should be in your search path. The simplest way to do this is to copy it to your C: directory RexxArpLib.library is available somewhere on AmiNet {========================================================================} { Use } {========================================================================} Basic Use ========= The basic use of GetURL is to download a single page from the World-Wide Web. This can be achieved by doing rx geturl http://www.cs.latrobe.edu.au/~burton/ This will download the page specified by the URL into an appropriately named file. For this example the file will be called Mosaic:www.cs.latrobe.edu.au/~burton/index.html The required directories will be created by GetURL if necessary. Options ======= GetURL has many command line options which allow you to do much more interesting things. The following is a discussion of each option individually. The names of the options can all be abbreviated but this may be unwise as you may be specifying a different option than you think. Help & Setup Options ==================== -Help e.g. rx geturl -help Prints a summary of all the options -Problem e.g. rx geturl -problem Allows a bug report or gripe or problem to be entered in from the CLI. -NewVersion <file> rx geturl -newversion t:geturl.rexx Downloads a new version of GetURL from my university account (assuming the university link hasn't gone down or something). Don't save the new copy over the old copy until you know it has been downloaded properly. -PatternMatching Downloads 'Match' from my university account. Match allows GetURL to use pattern matching in the restriction options (see the section on Match below) -Associative uses a different scheme to keep lists of URL addresses which is quite a bit faster -Delay set the delay between loading pages in seconds (defaults to 2 seconds) Control Options =============== -Recursive This causes GetURL to search each downloaded file for URLs and to then fetch each one of these pages and search those. As you can guess this will fill up your hard disc downloading every page on the planet (and several off the planet also). In fact this is what is called a Web Robot and should be used with due caution. Press control-c when you have had enough. GetURL will finish only when it's list of unvisited URLs is empty. -NoProxy Normally GetURL will try to use a proxy server if you have one set up. You can set up a proxy server by adding something like the following to your startnet script. setenv WWW_HTTP_GATEWAY http://www.mira.net.au:80/ setenv WWW_FTP_GATEWAY http://www.mira.net.au:80/ Where 'www.mira.net.au' is replaced by your local proxy host and '80' is replaced by the port number to which the proxy is connected. Proxies are normally connected to port 80. The NoProxy option causes GetURL to talk directly to the host in the URL rather than asking the proxy to do so. This should only be necessary if your proxy host isn't working properly at the moment. -NoProxyCache Asks the proxy host to refresh it's cache if necessary. Whether the proxy host takes any notice of this flag or not it it's problem. -Retry GetURL keeps a list of URLs that it couldn't access. If this flag is set, when no further URLs are available to visit, GetURL will make another attempt to fetch each of the pages that failed before. Restriction Options =================== The behaviour of each of these options depends on whether you have 'Match' installed or not. (see the section on match below) Note: pattern are not case sensitive. -Host <pattern> Allows you to specify or restrict URLs to certain hosts. e.g. rx geturl http://www.cs.latrobe.edu.au/~burton/ -host #?.au This will try to download any URLs connected to my home page that are in the Australian domain '.au' The pattern following '-host' can be any AmigaDOS filename pattern. See the DOS Manual for a full description of these patterns but just quickly here are a few examples #? - means 'any sequence of characters' (a|b) - means 'either option a or option b' ? - means 'any single character' ~a - means 'any string that does not match a' More specifically -host #?.au - means 'any host name ending in .au' -host (info.cern.ch|www.w3.org) - means 'either info.cern.ch or www.w3.org' -Path <pattern> This works in the same manner as the -Host option except that you are describing the pathname component of the URL e.g. rx geturl http://www.cs.latrobe.edu.au/~burton/ -recursive -path ~burton/NetBSD/#? This will try to download the NetBSD documentation from my university account. Note: don't start the path with a leading '/' character as a '/' is automatically inserted between the host and path parts of the URL e.g. rx geturl http://ftp.wustl.edu/ -recursive -path pub/aminet/#? rx geturl http://www.cs.latrobe.edu.au/~burton/ -path #?.(gif|jpg|xbm) -URL <pattern> This works in the same manner as the -Host option except that you are describing the whole URL e.g. rx geturl http://www.cs.latrobe.edu.au/~burton/ -recursive -url http://www.cs.latrobe.edu.au/~burton/#? -Number <num> Will only download <num> files across the network (not including the initial URL supplied on the command line) e.g. rx geturl http://www.cs.latrobe.edu.au/~burton/ -recursive -number 5 This will download my home page and 5 other pages which are mentioned in it. -Length <num> Will only download a file across the network if it is smaller than <num> bytes -IfModified Will only download a file across the network if it is newer than the file in the cache. If there is no appropriately named file in the cache directory then the file will be downloaded anyway. -Depth <num> Will only download files across the network until a recursion depth of <num> is reached. The first time a page is analyzed for URLs that page has depth 0, the URLs found in it have depth 1. The URLs found in pages of depth 1 have depth 2 and so on. This gradually develops into a tree of Web pages, with the initial URL as the root (level 0) each URL found in the root page hanging from the root (level 1) and the contents of their pages hanging from them. This option allows GetURL to stop at the desired depth of the tree. Without Match The documentation above assumes that Match is installed. If match is not installed then AmigaDOS patterns will not work. Instead you can use the '*' character as a place holder. e.g. -host *.cs.latrobe.edu.au - meaning 'any host in the cs.latrobe.edu.au domain' -path */index.html - meaning 'a file index.html in any single level directory' -url http://www.cs.*.edu/*/index.html - meaning 'a file called index.html in a single level directory on any WWW host in any U.S. university' Input & Output Options ====================== -Input <filename> Usually GetURL will start by downloading a page from the web and save it. This option allows you to specify a file on your hard disc to start with. e.g. assuming t:temp.html contains a list of URLs to search rx geturl -input t:temp.html -r This will search recursively starting with the specified file -Output <filename> Instead of saving the downloaded files into sensibly named files under Mosaic: this option allows you to append the material downloaded into a single file. Mostly useful for downloading a single file. e.g. rx geturl http://www.cs.latrobe.edu.au/~burton/ -output James.html -SaveRoot <dir> If you don't want the downloaded files to be saved under Mosaic: you can redirect them to another directory with this command e.g. rx geturl http://www.cs.latrobe.edu.au/~burton/ -saveroot tmp:cache -Visited <file> GetURL keeps a list of URLs that have already been visited. This allows GetURL to stop itself repeating by checking each new URL against the contents of the list of visted URLs. If you specify a file (it may or may not exist previously) with this option - that file will be used to check new URLs and to save URLs that have just been visited. e.g. rx geturl http://www.cs.latrobe.edu.au/~burton/ -visited tmp:Visited.grl -UnVisited <file> use specified unvisited file, visit these URLs first Normally GetURL will be used to download a starting file, search it for URLs and then visit each URL found in turn searching each of those files for further URLs. Each time GetURL finds a new URL it will append it to a file so that it can come back later and download that page. This option can be used to specify a file (it may or may not previously exist) which will be used for this purpose. If there are already URLs in the specified file these will be visited before any new URLs which are found. -Failed <file> When for some reason GetURL fails to download a file, the URL of that file is added to a list. This option causes GetURL to use the file specified (it may or may not exist previously) to store that list. -SaveHeaders When a file is retrieved using the HTTP protocol, the file is accompanied by a header, rather like an email message. This option causes GetURL to keep the header in a suitably named file. e.g. rx geturl http://www.cs.latrobe.edu.au/~burton/ -saveheaders will save 2 files Mosaic:www.cs.latrobe.edu.au/~burton/index.html Mosaic:www.cs.latrobe.edu.au/~burton/index.html.HDR {========================================================================} { I Robot } {========================================================================} The Problems with Robot ======================= GetURL is considered a 'robot' web client. This is because it can automatically download an indeterminate number of pages or files over the network. That is what a robot is. The power of a robot can be abused very easily, far more easily than it can be used for reasonable purposes. We all hate what has come to be called 'spamming' by commercial entities on the 'net', but abuse of a robot is no different. Indiscriminate robot use can make an entire network utterly useless for the period during which the program is run (possibly for a great deal longer). It is possible for a robot program to download hundreds of pages per minute. In order to help people make the best use of GetURL without overloading network resources and without annoying system administrators worldwide, GetURL has some builtin rules for containing the behaviour of the robot. Rules for Containing the Robot ============================== Once the robot has started off exploring the Web, it is important that it be restrained. There are two sorts of restraints in effect. Ones that you can control via the restriction options (described above), and some that you cannot control. Because GetURL is supplied as source code it is asked that you do not remove the restraints described here, but accept them. The web is a global resource with no equivalent of the police as of yet. So in order to leave it usable by other people you must use self restraint. * Database queries - any URL containing a question mark '?' is disallowed. This means that GetURL will not make database queries or provide parameters for CGI scripts. The reason being that the return from a query could, and very often does contains URLs for further queries leading to an infinite loop of queries. It is simpler to completely disallow any URL containing a '?' than to distinguish reasonable URLs. * Site exclusion - not implemented yet * Proxies - by default GetURL tries to use a proxy if one is configured This means that multiple fetches of the same page or fetching pages that are commonly fetched by other users are optimised. * Delay - by default GetURL waits 2 seconds between fetching files across the network. This stops GetURL fetching lots of pages in a short time making it's impact on the network negligable. * Non-optimal implementation - GetURL could be implemented in a much more efficient fashion. Both in terms of choice of compiler and choice of algorithm. In future I would like to redesign GetURL and redevelop it in C or a similar language, but at the moment the poor implementation results in somewhat of a restriction. More about Robots ================= The best thing to do if you need to know more about Robots is to read the following page with Mosaic. http://web.nexor.co.uk/mak/doc/robots/robots.html <A HREF="http://web.nexor.co.uk/mak/doc/robots/robots.html">World Wide Web Robots, Wanderers, and Spiders</A> {========================================================================} { Examples } {========================================================================} ** Examples of use of this wonderful program :-) (please send me yours) 1) rx geturl -h print help message 2) rx geturl http://www.cs.latrobe.edu.au/~burton/ -output t:JamesBurton.html get James' home page, save to file 3) rx geturl http://info.cern.ch/ -recursive -host info.cern.ch -visited t:visited fetch all pages on info.cern.ch reachable from http://info.cern.ch/, Keep a list 4) rx geturl -directory uunews: -unvisited t:news-urls search for URLs in all news articles, save to file, but don't visit them yet 5) rx geturl -problem make a suggestion, ask for help, send bug report 6) rx geturl -NewVersion t:geturl.rexx download most recent version of the script (not wise to put rexx:geturl.rexx) 7) an ADOS script called 'continue' .key url .bra [ .ket ] rx geturl [url] -visited tmp:geturl/visited -unvisited tmp:geturl/unvisited -recursive -saveroot tmp:geturl/cache -failed t:geturl/failed ;; call it every time you log on, and it will fetch a few more pages. Send it ctrl-c ;; to stop. Shouldn't repeat itself. add -retry to make it attempt to retry ;; URLs it couldn't get before (those in the failed file) 8) rx geturl http://www.cs.latrobe.edu.au/~burton/PigFace-small.gif download my portrait - just that and nothing more 9) rx geturl -PatternMatching download Match utility 10)rx geturl http://www.cs.latrobe.edu.au -r -path #?.(gif|jpg|xbm) download picture files in my home page {========================================================================} { Match } {========================================================================} Match is a small utility I wrote in C and compiled with DICE v3.0 The purpose of Match is to make AmigaDOS pattern matching available from within scripts. I couldn't find a simple way of making ARexx do this by itself, but I found an incredibly simple way using C. GetURL does not require Match, but is considerably more useful with Match. Match works as follows: Match <pattern> <string> e.g. Match #? abcde yes Match #?.abc abcde no Match #?.abc abcde.abc yes Match prints either 'yes' or 'no' depending on whether the string matches the pattern. Or in other language, if the pattern correctly described the string Match wil priny 'yes' otherwise 'no'. You can get hold of Match by using the following command line rx geturl -patternmatching Match should be installed somewhere in your shell's search path. The simplest way to do this is to copy match to your C: directory {========================================================================} { New Versions } {========================================================================} GetURL is still suffering from 'featuritis'. This means that every day I think of 5 new problems that GetURL could solve if only it had such and such an option. To make it worse, since I released GetURL onto the network, other people have been suggesting things that I hadn't thought of. The point is that GetURL is updated regularly. So I made GetURL able to download a new version of itself. If you use the command line rx geturl -newversion GetURL will attempt to download the latest version of itself. New versions may appear as often as once a week, especially if bugs or problems are found. Please mail me if you have any bugs, problems or suggestions. The easiest way to do this is to use the command line rx geturl -problem {========================================================================} { Known Bugs & Problems } {========================================================================} ** Problems, ToDos, ideas, ... (please suggest more) 1) http://www.cs.latrobe.edu.au != http://www.cs.latrobe.edu.au/ these count as separate - should be recognised as the same 2) warning if download an empty file 4) check content-length: header (add a nice display) 5) check content-type: 6) FILE://localhost - should change local_prefix temporarily 7) should delete temporary files - including visited.tmp files 8) when use -output the WHOLE file is searched every time a page is appended 9) would be nice to spawn off HTTP processes 10) need better search engine (grep?) I tried c:search but it's no use 11) make it smaller (takes too long to compile) 12) implement missing protocols, FTP notably 13) implement -directory, -update 14) implement -date <date> (only downloads pages newer than date) 15) convince Mosaic developers to add ability to check Mosaic: cache before downloading over the network 16) write HTTP cache server for AmiTCP (oh yeah sure!) 17) host/pathname aliases file 18) clean up an existing agenda file, removing things that are in the prune-file and other matching URLs 19) CALL not used in function call where a value is returned {========================================================================} { Glossary } {========================================================================} World-Wide Web (or WWW for short) For many years computers around the world have been connected together in what is known as the InterNet. This means that you can have real-time as-you-speak access to somebody else's computer. The Web uses this facility to access information on other people's computers (maybe in another country) and to present this information to you. GetURL is a CLI or shell command that allows you to browse the Web from your Amiga without human intervention. i.e. automatically URL (Universal Resource Locator - or web-address for medium) This is quite a complex way of saying what file you want. I won't attempt to describe it in full but here are a few examples. Note: GetURL probably won't understand all of these http://host[:port][/path] describes a file accessable via the HTTP protocol ftp://host[:port][/path] describes a file accessable via the FTP protocol telnet://host[:port] describes a telnet login gopher://host describes a file or directory accessable via Gopher mailto:user@host describes an email address Proxy Because millions of people are now using the Web, the network is suffering overload symptoms. Normally HTTP specifies that pages are downloaded directly from the place in which they are stored, but if you connect to a proxy server, than when you go to download a page your client (maybe GetURL or AMosaic) will ask the proxy host for the page. The proxy host either already has the file in it's cache (in which case that saves loading it again across the network to the proxy) or it goes and fetches it normally. Either way you end up getting the file, and on the average quite a bit quicker. AmiNet A group of computers which each have a set of programs for Amiga computers. Hundreds of thousands of Amiga users download files from AmiNet every day Patterns A way of describing something. Computer programs use patterns to decide whether something is of interest from a shell try list #?.info This will list all the files in the current directory that match the pattern '#?.info' See the AmigaDOS manual for more information. Case Sensitive The pertains to patterns. If a pattern is case sensitive then capital a 'A' will only match with another 'A' and not 'a'. If a pattern is not case sensitive (or case insensitive) then capitals and lower-case are considered the same letter, so 'A' matches with 'a'. Patterns in GetURL are case-insensitive because that's how the Match program works. {========================================================================} { Changes } {========================================================================} History of changes version 0.9 : initial beta release 08-Jan-95 version 1.0 : 15-Jan-95 - fixed problem in header parsing - now stops at lines containing cr 09-Jan-95 - fixed problem with name of header_file being 'HEADER_FILE' sometimes 09-Jan-95 - fixed : initial URL always added to visited file without check 09-Jan-95 - now will still run -recursive if no initial URL 09-Jan-95 - added -saveroot 09-Jan-95 - implemented FILE method to localhost 09-Jan-95 - added -number 10-Jan-95 - added -newversion 15-Jan-95 - added -failed 15-Jan-95 - added -retry 15-Jan-95 - added top of file data for HTML files only 15-Jan-95 - fixed -output 15-Jan-95 version 1.01 : 16-Jan-95 - fixed openlibrary problem 16-Jan-95 Brian Thompstone <brian@clrlight.demon.co.uk> - fixed relative URLs problem 16-Jan-95 Brian Thompstone <brian@clrlight.demon.co.uk> - fixed output file mode problem 16-Jan-95 version 1.02 : 26-Jan-95 (Australia Day) - added X-Mailer header 16-Jan-95 - added AmigaDOS regexp in -host, -path 16-Jan-95 - added Match downloader 17-Jan-95 - fixed HREF & SRC missed because capitals 19-Jan-95 - fixed mailto: ending up as a relative 19-Jan-95 - wrote GetURL.doc properly 22-Jan-95 - added GetURL.doc to -NewVersion 22-Jan-95 - changed global vars to gv.* style 26-Jan-95 - added -LENGTH (thanks again Brian) 26-Jan-95 - added -SaveHeaders (thanks again Brian) 26-Jan-95 - added -IfModified 26-Jan-95 - changed problem address back to Latcs1 26-Jan-95 - added associative 26-Jan-95 - added skip of the top of HTML files so that you don't always end up searching my home page :-) 26-Jan-95 version 1.03 : 01-Feb-95 - added -DELAY 29-Jan-95 - now refuses to work on any URLs containing a '?' 29-Jan-95 - only SRC & HREF links accepted (was accepting NAME) 29-Jan-95 - added configuration for TimeZone 29-Jan-95 - Finally, items are only added to agenda if not already there 29-Jan-95 - same for pruned, failed - fixed a few problems with -DEPTH and implemented it for non-array agenda 30-Jan-95 - added robot section to documentation 31-Jan-95 {========================================================================} { Credits } {========================================================================} GetURL.rexx was originally written by James Burton after a discussion with Michael Witbrock of AMosaic fame. GetURL is public domain. Thanks to Brian Thompstone for (i) actually using it (ii) actually telling me what was wrong with it (iii) writing some fixes and additions (iv) enthusiasm James Burton <burton@cs.latrobe.edu.au> Brian Thompstone <brian@clrlight.demon.co.uk> Michael J. Witbrock <mjw@PORSCHE.BOLTZ.CS.CMU.EDU> {========================================================================} { Contact } {========================================================================} James Burton c/o Department of Computer Science & Computer Engineering Latrobe University Bundoora, Victoria, 3083 Australia EMail: burton@cs.latrobe.edu.au Web: http://www.cs.latrobe.edu.au/~burton/ {========================================================================} { End of File 'GetURL.doc' } {========================================================================}